Rationale - the Data lake

  • The GRINS foundation aims at implementating a Data platform for the transfer of knowledge and statistical analysis (AMELIA)

  • Prime matter of the platform: the data lake

    • Broad repository hosting several categories of administrative data from different sources
    • Available to either private, corporate or academiic users
    • Data organised at the territorial level of municipalities (LAU/NUTS-4)
  • The present R package is intended as a one of the several contributions to the data lake

Rationale - the Data lake

  • This R package covers the dimension of public education, with special regards to the territorial structure of the education system.

  • Main utility: analysing territorial disparities in education quality and school infrastructure endowment

  • Directly supports areal modelling

Principles followed

  • Accessibility: All data must be publicly accessible and easy to handle for the generic user
    • Input data are open and come from publicly accessible web pages
  • Updating: All information is retrieved in real time in order to be up-to date
    • Inputs are scraped from the web rather than stored in a built-in repository
  • Portability: All objects should be easy to export and process with different softwares:
    • We work in the framework, and all outputs are structured as tibbles

Main function modules

  • Get_: input data scraping. Information is not altered and the user receives a data set as close as possible as the provider releases it

  • Util_: utilities; mainly data modification and editing

  • Group_: data aggregation at the relevant territorial level

    • NUTS-3/Province
    • LAU/Municipality
  • Map_: displaying

    • Static maps (vector format): easy to export
    • Interactive maps: preserve information on different variables

Main datasets

  • Data from the Ministry of Education
    • Includes:
      • National Schools Registry
      • School Buildings database
      • Students and teachers counts
    • Mainly available at the school level (except for the count of teachers)
  • Ultra - Broadband implementation
    • Available at the school level
  • Invalsi census survey
    • Available at the NUTS-3 / LAU level

Schools Taxonomy

  • Schools ID - mechanographical codes
    • Most complete list: National Schools Registry
    • Identifies both school order and address (of high schools)
  • School buildings ID - typically numeric codes
    • Only included in the School buildings DB

School buildings database

  • Main source of information regarding the school infrastructure

  • Mostly includes categorical variables, regarding several aspects such as:

    • Environmental context
    • Reachability by public or private transport
    • Building period
    • Surfaces and volumes
  • As an example, in the next slides we display middle schools area surface (on logarithmic scale to ease the comparison) for the three regions of Apulia, Basilicata and Calabria.

School buildings database

Functions: - Get_DB_MIUR() Scrape the raw data and return the school-buildings level database - Util_DB_MIUR_num() Convert raw data to numeric and edit if required - Group_DB_MIUR() Harmonise at the territorial level - Map_SchoolBuildings() Render

Input_DB23_MIUR <- Get_DB_MIUR(Year = 2023, 
                               input_Registry = Registry23) 
## ## 2022/23 is the latest year available
#à ## then, remember adding message = FALSE

DB23_MIUR_mun <-  Group_DB_MIUR(Input_DB23_MIUR, InnerAreas = FALSE
                                )$Municipality_data %>% 
  dplyr::mutate(log_Surface = log(.data$School_area_surface))

head(DB23_MIUR_mun)

School buildings database

DB23_MIUR_mun %>% 
  Map_School_Buildings(input_shp = Mun22_shp, field = "log_Surface", 
                       level = "LAU", order = "Middle",
                       region_code = c(16, 17, 18), verbose = FALSE)

Invalsi census survey

  • Aggregate measure of students skills, expressed either as the territorial average of:
    • Percentage of sufficient tests (primary schools only)
    • Ability of \(i\)-th student (\(A_i\)) to answer correctly the question \(Q_j\) of difficulty \(D_j\), based on the model \[Prob \lbrace Q_{ij} = 1 \rbrace = \frac{e^{A_i - D_j}}{1 + e^{A_i - D_j}}\]
  • Spatially homogeneous indicator
  • Three variables: -M_: mean;S_: standard deviation;C: coverage

Invalsi census survey

Example: Mathematics score for the last year of high school, year 2022/23, province level:

Map_Invalsi(input_shp = Prov22_shp,
            grade = 13, subj = "MAT", level = "NUTS-3", Year = 2023)
## Retrieving Invalsi census data for provincesEncoding raw content in UTF-8 
## Total running time to retrieve Invalsi NUTS-3 data: 4.21 seconds 
## Total running time to  process Invalsi NUTS-3 data for year 202223 subject English_R, English_L, Italian, Mathematics School year n. 13 : 
## 0.1 seconds
## Warning: Found less unique colors (106) than unique zcol values (107)! 
## Interpolating color vector to match number of zcol values.